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Abstract 

Byzantine quorum systems have been proposed that work properly even when up to / replicas fail arbitrarily. 
However, these systems are not so successful when confronted with Byzantine faulty clients. This paper presents novel 
protocols that provide atomic semantics despite Byzantine clients. Our protocols are the first to handle all problems 
caused by Byzantine clients. They prevent Byzantine clients from interfering with good clients: bad clients cannot 
prevent good clients from completing reads and writes, and they cannot cause good clients to see inconsistencies. In 
addition we also prevent bad clients that have been removed from operation from leaving behind more than a bounded 
number of writes that could be done on their behalf by a colluder. 

Our protocols are designed to work in an asynchronous system like the Internet and they are highly efficient. We 
require 3/ + 1 replicas, and either two or three phases to do writes; reads normally complete in one phase and require 
no more than two phases, no matter what the bad clients are doing. 

We also present strong correctness conditions for systems with Byzantine clients that limit what can be done on 
behalf of bad clients once they leave the system. Furthermore we prove that our protocols are both safe (they meet 
those conditions) and live. 



1 Introduction 

Quorum systems [4, 14] are valuable tools for building 
highly available replicated data services. A quorum sys- 
tem can be defined as a set of sets (called quorums) with 
certain intersection properties. These systems allow read 
and write operations to be performed only at a quorum of 
the servers, since the intersection properties ensure that 
any read operation will have access to the most recent 
value that was written. 

The original work on quorum systems assumed that 
servers fail benignly, i.e., by crashing or omitting some 
steps. More recently, researchers have developed tech- 
niques that enable quorum systems to provide data avail- 
ability even in the presence of arbitrary (Byzantine) faults [9]. 
Earlier work provides correct semantics despite server (i.e., 
replica) failures and also handles some of the problems of 
Byzantine clients [9, 10, 5, 1, 2, 11, 12]. 

This paper extends this earlier work in two important 
ways. First, it defines new protocols that handle all prob- 
lems of Byzantine clients. In addition, our protocols are 
more efficient than previous proposals, in terms of both 
operation latency and number of replicas. Second, the 
paper states correctness conditions for such systems and 
proves that protocols are correct. The correctness condi- 
tions are stronger than what has been stated previously [11] 
and what has been guaranteed by previous approaches. 

Since a dishonest client can write garbage into the shared 
variable, it may seem there is little value in limiting what 
bad clients can do. But this is not the case, for two reasons. 
First, bad clients can cause a protocol to misbehave so that 
good clients are unable to perform operations (i.e., the pro- 
tocol is no longer live) or observe incorrect behavior. For 
example, if the variable is write-once, a good client might 
observe that its state changes multiple times. 

Second, bad clients can continue to interfere with good 
ones even after they have been removed from operation, 
e.g., by a system administrator who learns of the misbe- 
havior. We would like to limit such interference so that, 
after only a limited number of writes by good clients, any 
lurking writes left behind by a bad client will no longer be 
visible to good clients. A lurking write is a modification 
launched by the bad client before it was removed from 
operation that will happen (possibly with help from an ac- 
complice) after it has left the system. By limiting such 
writes we can ensure that the object becomes useful again 
after the departure of the bad client, e.g., some invariant 
that good clients preserve will hold. 

Of course, it is not possible to prevent actions by a bad 



client, even if it has been shut down, if the private key it 
uses to prove that it is authorized to modify an object can 
be used by other nodes; thus, a bad client is in the system 
as long as any node knows its private key. If nodes can 
expose their private keys, we can make statements about 
behavior in the absence of a bad client only when all faulty 
nodes have been removed from operation [11] - a condi- 
tion that is not very useful in practice. A more practical ap- 
proach is make use of secure cryptographic coprocessors 
(like the IBM 4758 [7]) that allow signing without expos- 
ing the private key. Our correctness condition is stated in 
a way that allows either model about the lifetime of faulty 
nodes in the system. 

Thus, we make the following contributions: 

• We present strong correctness conditions for an atomic 
read/write object in a system in which both clients 
and replicas can be Byzantine. Our conditions en- 
sure atomicity for good clients, and also limit the 
effects of bad clients that have been removed from 
operation: our conditions bound the number of lurk- 
ing writes by a constant factor, and prevent lurking 
writes after bad clients stop and good clients subse- 
quently overwrite the data. 

• We present the first Byzantine quorum protocols that 
satisfy the conditions using only 3/ + 1 replicas (to 
survive / faulty replicas) and work in an unreliable 
asynchronous network like the Internet. Further- 
more our protocols are efficient: To do writes re- 
quires either 3 phases (our base protocol) or mostly 
2 phases (our optimized protocol). Reads usually 
require 1 phase; they sometimes need an additional 
phase (to write back the data just read). Our base 
protocol ensures that there can be at most one lurk- 
ing write after a bad client has left the system; the 
optimized protocol ensures slightly weaker behav- 
ior: there can be a most two lurking writes. 

• We are able to manage with a small number of repli- 
cas and phases because our protocol makes use of 
"proofs", an approach that we believe can be used 
to advantage in other protocols. A proof is a col- 
lection of 2/ + 1 certificates from different replicas 
that vouch for some fact, e.g., that a client has com- 
pleted its previous write, or that the state read from 
a single replica is valid. 

• We prove the correctness of our protocols, both safety 
and liveness. In addition we describe a variation of 



our protocols that supports a still stronger correct- 
ness condition, in which we can bound the number 
of writes of good clients that it takes to ensure that 
modifications of the bad client that has left the sys- 
tem will no longer be visible. This variation some- 
times requires an additional phase to do a write. 

The rest of this paper is organized as follows. We be- 
ing by describing our assumptions about the system. Sec- 
tion 3 describes our base protocol. Section 4 describes 
our correctness conditions and we prove our base protocol 
meets them in Section 5. Section 6 describes our opti- 
mized protocol and proves its correctness. Section 7 de- 
scribes a variation of our protocols that allows us to bound 
the number of overwrites needed to hide effects of lurking 
writes. Section 8 discusses related work and we conclude 
in Section 9. 

2 Model 

The system consists of a set C = {ci, ...,c n } of client 
processes and a set S = {s±, ..., s n } of server processes. 
Client and server processes are classified as either correct 
or faulty. Correct processes are constrained to obey their 
specification, i.e., they follow the prescribed algorithms. 
Faulty processes may deviate arbitrarily from their spec- 
ification, i.e., we assume a Byzantine failure model [8]. 
Note that faulty processes include those that fail benignly, 
and those that are initially correct and fail at some point in 
the execution of the system. 

We refer to the set of faulty clients as C^ a ^ and the 
set of correct clients as C ok , and consequently we have 
C = C bad ljC ok ( and respectively S = S bad US ok ). Note 
that our algorithms do not require the knowledge of C ok , 

^bad' ^ok' nor ^bad- in otner words, we do not assume 
that we can detect faults. 

We assume an asynchronous distributed system where 
nodes are connected by a network that may fail to deliver 
messages, delay them, duplicate them, corrupt them, or 
deliver them out of order, and there are no known bounds 
on message delays or on the time to execute operations. 
We assume the network is fully connected, i.e., given a 
node identifier, any other node can (attempt to) contact 
the first node directly by sending it a message. 

For liveness, we require that if a message is retransmit- 
ted repeatedly it will eventually be delivered. Note that we 
only require this for liveness; safety does not depend on 
any assumptions a about message delivery. 

We assume nodes can use unforgeable digital signa- 
tures to authenticate communication. More precisely, any 



node, n, can authenticate messages it sends by signing 
them. We denote a message m signed by n as (m) (Tn . 
And (with high probability) no node can send (m) a „ (ei- 
ther directly or as part of another message) on the network 
for any value of m, unless it is repeating a message that 
has been sent before or it knows n's private key. We call 
the set of possible signatures S. These signatures can be 
verified with public keys in a set P. 

We also assume the existence of a collision-resistant 
hash function, h, such that any node can compute a digest 
h(m) of a message m and (with high probability) it is im- 
possible to find two distinct messages m and m' such that 
h{m) = h(m'). 

These assumptions are probabilistic but there exist sig- 
nature schemes (e.g., RSA [13]) and hash functions (e.g., 
SHA-1 [3]) for which they are believed to hold with very 
high probability. Therefore, we will assume they hold 
with probability one in the rest of the paper. 

To avoid replay attacks we tag certain messages with 
nonces that are signed in the replies. We also assume that 
when clients pick nonces they will not choose a repeated 
nonce with probability one. 

3 BFT-BC Algorithm 

This section presents our construction for a read/write vari- 
able implemented using Byzantine quorum replication, and 
that tolerates Byzantine-faulty clients. We begin by giv- 
ing a brief overview of Byzantine quorums in Section 3.1. 
Then we present our base protocol. This protocol requires 
3 phases to write; in Section 6 we present an optimization 
that requires only 2 phases most of the time. 

3.1 Byzantine Quorums Overview 

This section gives an overview of how current algorithms 
use Byzantine quorums to implement a shared read/write 
variable. This presentation follows the original BQS pro- 
tocol [9], using their construction for a system that doesn't 
handle Byzantine clients. (We discuss this system further 
in Section 8.) 

A Byzantine quorum system defines a set of subsets 
of a replica group with certain intersection properties. A 
typical way to configure such a system is to use groups of 
3/ + 1 replicas to survive / failures with quorums of size 
2/ + 1 replicas. This ensures that any two quorums inter- 
sect in at least one non-faulty replica. Each of the replicas 
maintains a copy of the data object, along with an asso- 
ciated timestamp, and a client signature that authenticates 
the data and timestamp. 

Two phases are required to write the data. First, the 



client contacts a quorum to obtain the highest timestamp 
produced so far. The client then picks a timestamp higher 
than what was returned in the first phase, signs the new 
value and timestamp, and proceeds to the second phase 
where the new value is stored at a quorum of replicas. 



For example, the purpose of the prepare phase is for 
the client to inform the replicas of what it intends to do, 
and this must be acceptable, e.g., the timestamp it pro- 
poses can't be too big. Replicas that approve the prepare 
request return certificates that together provide a "prepare 



Replicas allow write requests only from authorized clients, proof". This proof is needed to carry out the write phase: 



A replica overwrites what it has stored only if the times- 
tamp in the request is greater than what it already has. 

The read protocol usually has a single phase where the 
client queries a quorum of replicas and returns the value 
with the highest timestamp (provided the signature is cor- 
rect). An extension of this protocol [10] requires a second 
phase that writes back the highest value read to a quorum 
of replicas (this ensures atomic semantics for reads). 

3.2 BFT-BC Protocol 



a client can only carry out a write that has been approved. 
If a write is allowed by a replica it returns a certificate and 
these certificates together form a "write proof". This proof 
is needed for the replica to do its next write: it cannot do 
a second write without completing its first one. This con- 
straint, plus the fact that proofs cannot be forged or pre- 
dicted in advance by bad clients, is what limits the num- 
ber of lurking writes a bad client can leave behind when 
it stops (to be carried out by some node that colludes with 



it). 
The protocol just presented is not designed to handle Byzantine- We nQW describe Qur protQCol in detail 



faulty clients, which can cause damage to the system in 
several ways, e.g.: 

1 . Not follow the protocol by writing different values 
associated with the same timestamp. 

2. Only carry out the protocol partially, e.g., install a 
modification at just one replica. 

3. Choose a very large timestamp and exhaust the time- 
stamp space. 

4. Issue a large number of write requests and hand 
them off to a colluder who will run them after the 
bad client has been removed from the system. This 
colluder could be one of the replicas, or a com- 
pletely separate machine. 

To avoid these problems, we need to limit what bad 
clients can do. But we must also allow good clients to 
make progress, in spite of bad clients that might collude 
with one another and/or with bad replicas. 

Our protocol accomplishes these goals. It uses 3/ + 1 
replicas, and quorums can be any subset with 2/ + 1 repli- 
cas. It uses a three-phase protocol to write, consisting of a 
read phase to obtain the most recent timestamp, a prepare 
phase in which a client announces its intention to write a 
particular timestamp and value, and a write phase in which 
the client does the write that it prepared previously. 

As it moves from one phase to another, however, the 
client needs to "prove" that what it is doing is legitimate. 
A proof takes the form of a quorum of certificates from 
different replicas that vouch for some fact. 



As mentioned, each object in BFT-BC is replicated at 
a set of 3/ + 1 replicas, numbered from to 3/. Quorums 
can be any subset with 2/ + 1 replicas and we use a three- 
phase protocol to write. 

Each replica maintains the following per-object infor- 
mation: 

• data, the value of the object. 

• current-ts, the associated timestamp. 

• prepare-proof, a valid prepare proof for above times- 
tamp and a hash of the value. 

• prepare-list , a list containing the timestamp, hash, 
and proposing client of currently prepared writes. 

• write-ts, the timestamp of the latest write known to 
have completed at 2/ + 1 replicas. 

In the remainder of this paper we simplify the presen- 
tation by considering a system containing only a single 
object, and therefore we omit object identifiers from the 
description of the protocol. 

3.2.1 Write Protocol 

Our protocols require that different clients choose differ- 
ent timestamps, and therefore we construct timestamps by 
concatenating a sequence number with a client identifier: 
ts = (ts.val,ts.id). We assume that client identifiers are 
unique. To increment a timestamp a client with identifier c 
uses the following function: succ(ts, c) = (ts.val + 1, c). 



Timestamps can be compared in the usual way, by com- 
paring the val parts and if these agree, comparing the 
client ids. 
Client Processing 

To write a data object, clients go through a three-phase 
protocol. In all phases, clients retransmit their requests to 
account for lost messages; they stop retransmitting once 
they collect a quorum of valid replies. 
Phase 1. The client, c, sends a READ-TS request to all 
replicas; the request contains a nonce to avoid replays. 
The replies each include a prepare proof for the timestamp 
being returned. The client waits for a quorum of correctly 
authenticated replies. It selects the largest timestamp t max 
it received, computes a new timestamp t' = succ(t max , c), 
and moves to phase 2. 

Phase 2. The client sends PREPARE requests to all repli- 
cas. The request contains the new timestamp, the prepare 
proof obtained in phase 1, a hash of the value it intends 
to write, and a write proof (for the most recent write done 
by this client, or null if this is the first write done by this 
client), all authenticated by the client. The client waits for 
a quorum of PREPARE-ACKs, and then moves to phase 3; 
the 2/ + 1 responses it received in phase 2 constitute a 
prepare proof for the write. 

Phase 3. The client sends a WRITE request containing the 
prepare proof along with the new value, and waits for a 
quorum of valid replies, all authenticated by the client. 

Each reply contains a signed statement vouching for 
the write for that timestamp and the replica id of the sender. 
A vector of 2/ + 1 such statements constitutes a write 
proof for the current operation. 
Processing at replicas 

Phase 1. Replicas reply to a READ-TS request by sending 
the prepare proof for their current timestamp and value, 
plus the nonce, all authenticated by the replica. 
Phase 2. Phase 2 processing is the crucial part of the al- 
gorithm. The replicas check to ensure that the timestamp 
being proposed is reasonable, that the client is doing just 
one prepare, that the value being proposed doesn't differ 
from a previous request for the same timestamp, and that 
the client has completed its previous write. 

The replica discards the message if it is invalid: if the 
prepare or write proof can't be validated, the timestamp 
doesn't have the client id in the low-order part, the pro- 
posed timestamp isn't the successor of the timestamp in 
the prepare proof, or the client isn't authorized to write 
the object. 

Next, the replica uses the write proof to update its write-ts: 
it sets this to the larger of the timestamp in the write proof 



it just received, and what it already has stored. Then it re- 
moves any entry from its prepare-list whose timestamp is 
less than or equal to the write-ts. 

If there is still an entry in the prepare-list for the client, 
the replica discards the request unless that entry matches 
what is being proposed (both the timestamp and the hash). 

If the replica accepts the request it updates the prepare- 
list by adding the timestamp and hash for this client (un- 
less this information is already in the list). Then it returns 
a PREPARE- ACK message containing a signed certificate 
for the new timestamp and hash. 

Phase 3. Replicas carry out the request only if it is valid: 
the prepare proof must be valid and the new value must 
match the hash in the proof. Replicas respond to valid re- 
quests by sending a WRITE- ACK containing a signed cer- 
tificate for the new timestamp. The replica additionally 
modifies its state if the timestamp in the request is greater 
than its current timestamp; in this case it stores the pro- 
vided information as the current value, timestamp, and as- 
sociated prepare proof. 

3.2.2 Read Protocol 

The read protocol usually requires just one phase. 
Phase 1. The client sends a READ request to all replicas; 
the request contains a nonce. A replica replies with its 
value, timestamp, prepare proof, and nonce all signed by 
it. The client waits for a quorum of valid responses and 
chooses the one with the largest timestamp (this is the re- 
turn value). If all the timestamps are the same the read 
protocol ends. 

Phase 2. Otherwise the client goes on to the write-back 
phase for the largest timestamp; this is identical to phase 3 
when a client is doing a write, except that the client needs 
to send only to replicas that are behind, and it must wait 
only for enough responses to ensure that 2/ + 1 replicas 
now have the new information. 

3.3 Discussion 

In the above description we mentioned that certain mes- 
sages or statements were authenticated, but the kind of 
authentication that may be used was unspecified. This 
issue is important since different techniques have differ- 
ent costs: we can authenticate a point-to-point message by 
means of symmetric cryptography by establishing session 
keys and using message authentication codes (MACs). This 
does not work for signing statements that have to be shown 
to other parties, in which case we need to rely on more ex- 
pensive public key cryptography. 



Our protocol requires signing using public key cryp- 
tography in two places: the phase 2 and phase 3 responses. 
These signatures are needed because they are used as proofs 
offered to third parties, e.g., the prepare proof is generated 
for one client but then used by a different client to justify 
its choice of the next timestamp. 

A further point is that only the phase 2 response sig- 
nature needs to happen in the foreground. The signature 
for the phase 3 response can be done in the background: 
a replica can do this after replying to the phase 2 request, 
so that it will have the signature ready when the phase 3 
request arrives. 

4 Correctness Condition 

This section defines the correctness conditions for a vari- 
able shared by multiple clients that may incur Byzantine 
failures. We begin by defining histories (Section 4.1), and 
then we give our correctness condition. 

4.1 Histories and Stopping 

We use a definition of history similar to the one proposed 
in [6], extended to handle Byzantine clients. 

The execution is modeled by a history, which is a se- 
quence of events of the following types: 

• Invocation of operations. 

• Response to operation invocations. 

• Stop events. 

An invocation by a client c is written (c : x.op) where 
x is an object name and op is an operation name (pos- 
sibly including arguments). A response to c is written 
(c : x.rtval) where rtval is the response value. A re- 
sponse matches an invocation if their object names agree 
and their client names agree. A stop event by client c is 
written (c : stop). 

A history is sequential if it begins with an invocation, if 
every response is immediately followed by an invocation 
(or a stop or no event), and if every invocation is imme- 
diately followed by a matching response. A client subhis- 
tory H\c of a history H is the subsequence of all events in 
H whose client names are c. A history is well-formed if 
for each client c, H\c is sequential. We use Ji to denote 
the set of well-formed histories. 

A object subhistory history H\x is the subsequence of 
all events whose object names are x and a history H is 
a single-object history for some object x if H\x = H. A 
sequential specification for an object is a prefix-closed set 



of single-object sequential histories for that object. A se- 
quential history H is legal if each object subhistory H\x 
belongs to the sequential specification of x. 

An operation o in a history is a pair consisting of an 
invocation inv(o) and the next matching response rsp(o). 
A history H induces an irreflexive partial order <h on 
the operations and stop events in H as follows: oq <h 
o\ if and only if rsp(oo) precedes inv(o\) in H; oq <h 
(c : stop) if and only if rsp(oo) precedes (c : stop) in 
H; (c : stop) <h 0\ if and only if (c : stop) precedes 
inv (oi) in H; and (c\ : stop) <h (c2 : stop) if and only 
if (ci : stop) precedes (c2 : stop) in H. 

4.1.1 Verifiable Histories 

In [6], the notion of linearizability is applied to a concur- 
rent computation, which is modeled by a history (a finite 
sequence of invocation and response events by all pro- 
cesses). 

It would be difficult to model such computations in our 
environment, since faulty processes do not obey any spec- 
ification, and therefore we cannot define what an invoca- 
tion or a response means for such processes. However, we 
know that after a STOP event from a faulty process it will 
halt. 

Therefore we introduce the concept of a verifiable his- 
tory which is a history that contains the sequence of invo- 
cations and responses from correct clients, and stop events 
from faulty clients. 

The correctness condition we present next is applicable 
only to verifiable histories. 

4.2 Correctness Condition 

We are now ready to define the correctness condition for 
variables shared by multiple processes that may incur Byzan- 
tine failures. 

The idea is that, as in linearizability [6], we require 
that the verifiable history look plausible to the correct pro- 
cesses, i.e., that there is a sequential history in which all 
processes are correct that explains what the correct pro- 
cesses observed. Furthermore, we require that once a faulty 
client stops, its subsequent effect on the system is limited 
in the following way: The number of operations by that 
client that are "seen" by correct clients after it stopped is 
bounded by some constant. 

These concepts can be formalized as follows. 
Definition 1 A verifiable history H £ His BFT-linearizable 
if there exists some legal sequential history H' £ Ji such 
that 



1. H\p = H'\py P €C ok 

2. <H<^<H> 

3. Vc G Q, a j : (c : stop) £ H => 

(3/ii, h 2 C H' : H' = hi (c : stop) /i 2 A 
|{o G h 2 : o = (c : x.op)}\ < max-b) 



Points 1 and 2 of the definition above state that there 
must exist a sequential history that looks the same to cor- 
rect processes as the verifiable history in question, and that 
sequential history must preserve the <h ordering (in other 
words, if an operation or a stop event precedes another op- 
eration or stop event the verifiable history, then the prece- 
dence must also hold in the sequential history). 

Point 3 says that if all faulty clients stop, then when 
we look at the sequential history and consider the sub- 
history after the last stop event by faulty clients (this is 
h 2 ), the number of events by faulty clients in that history 
is bounded by max-b. 

Note that there is a counter-intuitive notion that a bad 
client can execute operations after it has stopped. This cor- 
responds to the notion that the bad client left some pending 
operations (e.g., with a colluder) before it left, and this is 
precisely the effect we want to minimize. 

We can now define a BFT-linearizable object to be an 
object whose verifiable histories are BFT-linearizable with 
respect to its sequential specification. 

5 Correctness Proof 

This section sketches a proof that the algorithm presented 
in Section 3 meets the correctness conditions presented in 
Section 4. 

The idea is to look at what kind of properties we can 
ensure given a certain state of the system when a faulty 
client c stops. If we look at the current-ts values stored at 
that instant, we can guarantee the following. 
Lemma 1. Let ts ma x be the / + 1st highest timestamp 
stored by non-faulty replicas at time t s t p (when some 
faulty client c stops). Then the following must hold 

1. At any time up to t s t n, no node can collect a write 
proof for a timestamp t such that t' > ts max . 

2. There are no two timestamps ti, t 2 > ts max such that 
client c assembled a prepare proof for t\ and t 2 . 

3. No two prepare proofs exist for the same timestamp 
t > tsmax and different associated values. 

Proof. 



1. By algorithm construction, a nonfaulty replica will not 
sign an entry in a write proof vouching for a times- 
tamp higher than the one held in the variable current-ts. 
Since non-faulty replicas always store increasing times- 
tamp values, this means that the number of signatures 
that can be held in the system at time t s top for times- 
tamps higher than ts ma x is at most 2/ (i.e., / from 
faulty replicas and the / correct replicas that may hold 
timestamps higher than ts max ). 

2. By contradiction, suppose that client c held prepare 
proofs for t\,t 2 , both greater than ts ma x- The two 
proofs intersect in at least one nonfaulty replica. By 
part (1) of this lemma, that replica had its write-ts vari- 
able always at a value less than or equal to ts max (at 
all times up to and including t s t p). Therefore that 
replica could not have signed both proofs, since after 
signing the first one (say, for t\) it would insert an 
entry for client c and t\ in its prepare-list, and that 
entry would not be removed (because of the value of 
write-ts), which prevents the replica from signing the 
proof for t 2 . 

3. Suppose, by contradiction that two prepare proofs exist 
for timestamp t and values v and v'. By the quorum in- 
tersection properties, these two prepare proofs contain 
at least one signature from the same correct replica. By 
part (1) of this lemma, no write proof was ever assem- 
bled for timestamps greater or equal than ts ma x, and 
these entries in the prepare list were never removed. 
This violates the constraint that correct replicas do not 
sign a timestamp that is already in its prepare list for a 
different value. 

D 

We are now ready to show the correctness of our algo- 
rithm. 

Theorem 1. The BFT-BC algorithm is BFT-linearizable. 
Proof. Consider any correct reader in the case that the 
writer is correct. In this case, the quorum initially ac- 
cessed in a read operation intersects the quorum written 
in the most recently completed write operation in at least 
one correct replica. Therefore, the read returns either the 
value in the most recently completed write, or a value with 
a higher timestamp (which could be written concurrently 
with the read). Since a read also writes back its obtained 
value and timestamp to a quorum of processes, any subse- 
quent read by a correct reader will return that timestamp 
value or a later one. So, for any execution, we construct 
the history needed to show BFT-linearizability by putting 
every read right after the write whose value it returns. 



If the writer is faulty, we construct the sequential his- 
tory to show BFT-linearizability as follows: for each read 
by a correct reader returning v such that the phase 3 re- 
quest for v was produced by client q,, insert a write opera- 
tion in the history that writes v (by client c&) immediately 
before the read. 

Insert a stop event before the invocation of the first op- 
eration that succeeded the stop event in the original (veri- 
fiable) history (i.e., as late as possible while preserving the 
<H dependencies). 

It is left to show that if a faulty process stops, this 
history contains a bounded number of operations by that 
faulty process after it stops. 

To prove this condition we note that Lemma 1 part (2) 
says that client c only assembled prepare proofs for a sin- 
gle timestamp for a write that would become visible after 
it stopped (i.e., with a timestamp greater than ts max ), and 
Lemma 1 part (3) implies that if the write were to become 
visible, the prepare proof could only correspond to a sin- 
gle value. 

This means the number of operations by the faulty client 
c after it stops is at most one. 

5.1 Liveness 

As far as liveness is concerned, we guarantee that good 
clients can always execute read operations in the time it 
takes for two client RPCs to complete at 2/ + 1 replicas 
(i.e., the requests reaching the replicas and the respective 
replies returning to the client). This is so because at least 
2/ + 1 replicas will provide them with the appropriate an- 
swers (client requests are answered unconditionally pro- 
vided the requests are well-formed). For the write proto- 
col, the operation will complete in the time for three client 
RPCs to complete in 2/ + 1 replicas, and the trick is to 
ensure that their phase 2 requests are never refused (since 
phase 1 requests are answered unconditionally and phase 3 
requests are answered if they contain a valid prepare proof, 
which good clients will always send). 

The phase 2 request also get replies since the client will 
submit the write proof for its latest write alongside with it, 
which will force the replica to discard the client's entry in 
the prepare list and accept the prepare request. 

6 Optimized BFT-BC Algorithm 

This section describes our optimized protocol, which usu- 
ally requires only two phases to do a write, and proves that 
the optimized protocol satisfies our correctness condition. 



6.1 Optimized BFT-BC 

Our plan is to avoid one of the phases by merging it with 
another phase. We cannot merge phases 2 and 3 since the 
protocol requires that the prepare information be stored at 
/ + 1 honest replicas before the client can do phase 3: this 
is how we avoid a bad client being able to create many 
writes in advance that could be launched after it leaves the 
system. Therefore, we will merge phases 1 and 2. 

The idea is that the client sends the hash in the phase 
1 request and the replica does phase 2 on its behalf: it 
predicts the next timestamp, adds the new timestamp with 
the hash to the prepare-list, and returns this information, 
signed by itself. If the client receives a quorum of re- 
sponses all for a particular (highest) new timestamp, it can 
move to phase 3 immediately. 

This optimization will work well in the normal case 
where a write is actually received by all replicas. There- 
fore the good client is highly likely to receive a quorum of 
acknowledgments for the same timestamp in the responses 
in phase 1 , and most of the time a write will require two 
phases. 

However there are problems that must be overcome for 
the optimization to be used. They occur when there are 
concurrent writes or when other clients are faulty. 

A simple example is the following. Suppose a client 
does phase 1 for some hash h. It receives PREPARE- ACKS 
for a particular timestamp t from replicas Rl and R2, and 
PREPARE- ACKS for a larger timestamp t' from replicas R3 
and RA. The client will then be unable to get approval 
for either t or t', because to approve the request a replica 
would need to violate our restriction on having at most one 
entry per client on the prepare-list. 

Clearly we need to find a way to allow clients to do 
writes in cases like this one. But given our current con- 
straints this is not possible. 

To allow the client to make progress we will weaken 
our constraint on how many entries a client can have in the 
prepare list. In particular we will change the replica state 
to have a second prepare list, opt-list, containing entries 
produced by the replica, with the constraint that a client 
can have at most one entry per list. The two entries for 
the same client (one in each list) might be for different 
timestamps, or they might be for the same timestamp. In 
the latter case, it will be possible for the write with that 
timestamp to happen twice (when there is a bad client). In 
this case we need a way to order the writes: we will do 
this by using the numeric order on their hashes. 



6.2 Detailed Write Protocol 

Phase 1. The client, c, sends a READ-TS-PREP request 
containing the hash of the proposed value, and the client's 
current write proof, to all replicas. 

Replicas process the request in the usual phase-2 way, 
including removing entries from the prepare-lists. Then it 
will do the prepare on behalf of the client for timestamp 
i! = succ(ts,c), unless the client already has an entry in 
either prepare list for a different timestamp or hash. If 
the prepare is successful, the replica adds an entry to the 
opt-list for t' and the hash (unless that entry is already in 
the list) and returns a PREPARE-ACK signed by it; other- 
wise it returns a normal phase 1 response. 

If the client gets a quorum of PREPARE-ACKS (i.e., 
obtains a prepare proof) for the same new timestamp, it 
moves to phase 3. 

Phase 2. Otherwise the client chooses the new timestamp 
as before and carries out phase 2 for it (as in the normal 
protocol). When it has a quorum of signatures for this 
choice (obtained either in phase 1 or phase 2), it moves to 
phase 3. 

Replicas handle phase 2 requests as in the normal pro- 
tocol; they ignore the opt-list in this processing. 
Phase 3. The client does phase 3 in the normal way. 

The replica also does normal processing, except that 
the timestamp in the write might match the current times- 
tamp, but with a different value: in this case it retains the 
value with the larger hash. 

6.3 Discussion 

The optimized protocol can lead to a client having entries 
on two prepare lists (normal and optimized). A dishonest 
client can exploit this to carry out phase 3 twice as part 
of the same write request. And as a result, each dishon- 
est client can leave two lurking writes behind when it is 
removed from the system. 

Another point is that it is now possible for honest clients 
to see valid responses to a read request that have the same 
timestamp but different values. The client protocol re- 
solves this situation by returning (and writing back) the 
value with the larger hash. 

6.4 Optimized BFT-BC Correctness Proof 

For the optimized protocol some of the invariants shown 
in Section 5 need to be slightly modified. Notably, Lemma 
1 parts (2) and (3) no longer hold, since a faulty client can 
now collect two distinct prepare proofs (for its entries in 
prepare-list and opt-list). Therefore Lemma 1 parts (2) 
and (3) become: 



Lemma 1'. 

(2 + 3). For the optimized BFT-BC protocol, no more than 
two prepare proofs can exist for distinct timestamp, value 
pairs, with timestamps greater than ts ma %- 
Proof. Same as Lemma 1, but taking into consideration 
that the new algorithm allows for one entry in the normal 
prepare list, and another in the optimistic prepare list. 

D 

This only affects the proof of Theorem 1 in that the 
number of operations by a faulty client after it stops be- 
comes 2 instead of 1 , and therefore the main theorem still 
holds: 

Theorem 2. The optimized BFT-BC algorithm is BFT- 
linearizable. 

7 Stronger Correctness Conditions 

In this section we propose a stronger correctness condition 
than the one introduced in Section 4, and discuss how to 
extend the current protocol to meet the new condition. 

7.1 New Correctness Conditions 

The idea is that we want to further limit the effects a bad 
client can have on the system once it stopped such that, 
if correct clients execute a certain number of write opera- 
tions after a faulty client stops, then no more operations by 
that client will ever be "seen". (We can generalize this to 
any state-overwriting operation in a set overwr j te in case 
the variable has an interface other than read and write.) 

We formalize the BFT-linearizable+ condition as be- 
ing equal to the condition presented in Section 4, except 
that point (3) is modified to state that no operations by the 
faulty client c appear after the kth state-overwriting oper- 
ation in li2- 

7.2 Modified BFT-BC protocol 

To meet this stronger correctness condition, we merely 
need to modify the BFT-BC protocol to require the client 
to submit a write proof in its prepare request (along with 
the information it already sends in the prepare). This proof 
certifies that the timestamp it is proposing is the successor 
of one that corresponds to a write that has already hap- 
pened; a replica will discard the request if this condition 
doesn't hold. 

The client can easily assemble this proof in phase 1 
if all responses to its phase 1 request contain the same 
timestamp. If they don't the client can obtain the proof 
by doing a write-back; the difficulty is getting the value to 
write back. This could be accomplished by having phase 
1 actually be a read, but this is unattractive since values 



can be large. So instead, the client fetches the value, if 
necessary, in a separate step; effectively it redoes phase 
1 as a normal read, although this can be optimized (e.g., 
to fetch from the / + 1 replicas that returned the largest 
timestamps). 

This scheme guarantees that the timestamp in the lurk- 
ing write is the successor of a lower bound on the value 
stored by at least / + 1 non-faulty replicas when the bad 
client stopped. Consequently, if there were two consecu- 
tive writes by correct clients after the bad client stopped, 
the lurking write would no longer be seen. 

8 Related Work 

In this section we discuss the previous work that dealt with 
bad clients, and also the previous work on correctness con- 
ditions. 

Our protocol is more efficient than those proposed pre- 
viously. Furthermore it enforces stronger constraints on 
the behavior of bad clients: we handle all problems han- 
dled by previous protocols plus we limit the number of 
lurking writes that a bad client can leave behind after it 
has left the system. 

In addition we provide stronger safety and liveness guar- 
antees than previous protocols: in particular read opera- 
tions cannot return null values, and reads terminate in a 
constant number of rounds, independently of the behavior 
of concurrent writers. 

Our use of prepare proofs is what enables us to ensure 
liveness, since this allows a client to complete a read using 
information it received from just one replica. No previous 
protocol uses this technique. As a result they can only en- 
sure that a read will return the result of the latest write in 
a constant number of rounds provided there are no con- 
current writes. Yet concurrent writes are a real issue in a 
system with Byzantine clients, since bad clients can use 
this to prevent good clients from making progress: they 
can either write continually, or they can write partially (not 
complete their writes), and either of these causes problems 
for readers. 

The previous proposals fall into two categories: those 
that ensure that at most one value can be associated with 
a timestamp, and those that enforce non-skipping times- 
tamps. 

The initial work on Byzantine quorum systems [9] de- 
scribed the quorum consensus protocol discussed in Sec- 
tion 3, which used 3/+ 1 replicas, one phase reads and two 
phase writes, and did not handle Byzantine clients. That 
paper also described a protocol for a system of 4/ + 1 
replicas that prevented malicious clients from associating 



different values with different timestamps; the protocol re- 
quired a three-phase communication among the replicas to 
be carried out as part of performing a write (in addition to 
the client-server communication) where servers guarantee 
that each value and timestamp pair is propagated to a quo- 
rum. The protocol provides liveness, but at the expense 
of providing weak semantics for reads where they could 
return a null value in case of concurrent writes. 

The Phalanx system [10] improves on the previous re- 
sult in two ways. First it added the write-back phase (to the 
simpler protocol) as a way to ensure atomicity for reads. 
In addition it presented a more efficient protocol to handle 
Byzantine clients. That protocol used 4/ + 1 replicas, but 
clients carried out a three-phase protocol to do a write, and 
the server-to-server communication was no longer needed. 
In the new protocol for Byzantine clients, read operations 
could return a null value if there was an incomplete or a 
concurrent write. 

The work by Goodson et al. [5] proposes a solution for 
erasure-coded storage that tolerates Byzantine failures of 
clients and servers; this system requires 4/ + 1 replicas. 
This work is focused on integrating erasure coded storage 
with Byzantine quorum protocols and they tolerate Byzan- 
tine clients that write different "values" to different repli- 
cas by having the next reader detect the inconsistency and, 
if possible, repair it. In some cases, reads may return a null 
value. 

The work of Martin et al. [12] proposes a protocol that 
only uses 3/ + 1 replicas (like our protocol). They re- 
quire a quorum of 2/ + 1 identical replies for read oper- 
ations to succeed, which is difficult to ensure in an asyn- 
chronous system. Their solution is to assume a reliable 
asynchronous network model, where each message is de- 
livered to all correct replicas. This means that infinite re- 
transmission buffers are needed in an asynchronous envi- 
ronment like the Internet: the failure of a single replica 
(which might just have crashed) causes all messages from 
that point on to be remembered and retransmitted. In this 
protocol concurrent writers can slow down readers. The 
authors discuss how to extend their protocols to guaran- 
tee that Byzantine clients associate the same value with 
the same timestamp by having servers forward the writes 
among themselves (again, this is possible due to their net- 
work model) and keep the highest value for each times- 
tamp. They also discuss how to prevent a faulty client 
from exhausting resources at correct servers, but at the ex- 
pense of possibly sacrificing liveness for correct readers. 

Two papers (Bazzi and Ding [1], Cachin and Tessaro [2]) 
describe protocols that enforce non-skipping timestamps. 



In both cases, the protocols are based on the Martin et 
al. work, which uses the reliable asynchronous network 
model with the problems we mentioned above. The proto- 
col of Bazzi and Ding [1] requires 4/ + 1 replicas. Cachin 
and Tessaro [2] require 3/ + 1 replicas; an added prob- 
lem in this system is the use of threshold signatures to 
vouch for an increment on the current timestamp (since 
these signatures are very expensive). They also employ an 
expensive reliable broadcast mechanism for client-server 
communication. 

Our definition of BFT-linearizability presented in Sec- 
tion 4 builds upon a condition (called Byznearizability) 
proposed by Malkhi et al. [11]. This work was the first 
to point out the problem of lurking writes. However, their 
correctness condition is weaker than ours, since they re- 
quire only that the number of lurking writes is finite, whereas 
we require that the number of lurking writes is bounded 
by a constant. In fact, their correctness condition is met 
by the variant of the Phalanx protocol [10] that was de- 
signed to handle only honest clients. Another point is that 
this paper did not consider the use of secure co-processors 
to prevent malicious clients from leaking their private key, 
and therefore their condition requires all faulty clients and 
servers to stop in order to provide any guarantees. 

Our condition is also stronger due to the fact that we 
consider the number of overwrites to mask all the lurking 
writes (see Section 7). 

9 Conclusions 

This paper has presented novel protocols for Byzantine 
quorum systems that provide atomic semantics despite Byzan- 
tine faulty clients. Ours is the first protocol to fully address 
the complete range of problems that can arise with Byzan- 
tine clients. We prevent bad clients from leaving behind 
more than one or two lurking writes and from exhausting 
the timestamp space. In addition Byzantine clients are un- 
able to interfere with good clients in the sense that they 
cannot prevent good clients from completing reads and 
writes, and they cannot cause good clients to see incon- 
sistencies. Another point is that an extension of our pro- 
tocol can additionally ensure that the effects of a Byzan- 
tine client are no longer visible to good clients at all after 
two successive writes by good clients (or four successive 
writes in the optimized protocol). 

We also presented strong correctness conditions that 
address the above problems: we require that protocols 
guarantee atomicity for good clients, and limit what can 
be done on behalf of bad clients once they leave the sys- 
tem. Furthermore we proved that our protocols are both 



safe (they meet those conditions) and live. 

Our protocols are designed to work in an asynchronous 
system like the Internet and they are highly efficient. Our 
base protocol completes writes in three network round- 
trips; the optimized protocol reduces this cost so that writes 
normally complete in two network round-trips (at the ex- 
pense of allowing one more lurking write). In either case 
reads normally complete in one phase, and require no more 
than two phases, no matter what the bad clients are do- 
ing. We achieve these efficiencies because of our use of 
"proofs", which allow clients or replicas to know that in- 
formation presented to them is valid, without having to 
hear the same thing directly from 2/ + 1 replicas. 
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